Credit Card Custormers EDA - Churn Classification ¶

Link for dataset: https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers

This dataset shows that some customers are leaving a credit card service. So, our job will be to investigate the data and try do predict who is gonna get churned, so we can contact proactively these customers to avoid this process

In [1]:
# Importing libraries

import pandas as pd
import plotly.express as px
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import shap
from ydata_profiling import ProfileReport
from matplotlib import gridspec

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

Get and Inspect the Data¶

In [2]:
# Read the data
df = pd.read_csv("BankChurners.csv")
In [3]:
# Seeing all the columns names:
for column_headers in df.columns:
    print(column_headers)
CLIENTNUM
Attrition_Flag
Customer_Age
Gender
Dependent_count
Education_Level
Marital_Status
Income_Category
Card_Category
Months_on_book
Total_Relationship_Count
Months_Inactive_12_mon
Contacts_Count_12_mon
Credit_Limit
Total_Revolving_Bal
Avg_Open_To_Buy
Total_Amt_Chng_Q4_Q1
Total_Trans_Amt
Total_Trans_Ct
Total_Ct_Chng_Q4_Q1
Avg_Utilization_Ratio
Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2

Description of the dataset:¶

  • CLIENTNUM -> Client number
  • Attrition_Flag -> Show if the account is closed (1) or not (0)
  • Customer_Age -> The age of the customers
  • Gender -> The gender of the customers
  • Dependent_count -> Number of dependents a costumer have
  • Education_Level -> Educational qualification of the customer
  • Marital_Status -> If a customer is Married, Single, Divorced or Unknown
  • Income_Category -> Annual income of the customers
  • Card_Category -> Type of Card (Blue, Silver, Gold, Platinum)
  • Months_on_book -> Period of relationship with bank
  • Total_Relationship_Count ->Total number of products held by the customer
  • Months_Inactive_12_mon -> Number of months inactive in the last 12 months
  • Contacts_Count_12_mon -> Number of contacts in the last 12 months
  • Credit_Limit -> Credit Limit on the Credit Card
  • Total_Revolving_Bal ->
  • Avg_Open_To_Buy -> Open to buy credit line (average of last 12 months)
  • Total_Amt_Chng_Q4_Q1 -> Change n transaction amount (Q4 over Q1)
  • Total_Trans_Amt -> Total transaction amount (last 12 months)
  • Total_Trans_Ct -> Total transaction count (last 12 months)
  • Total_Ct_Chng_Q4_Q1 -> Change in transaction count (Q4 over Q1)
  • Avg_Utilization_Ratio -> Average card utilization ratio
In [4]:
df.head(3)
Out[4]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1 Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 ... 12691.0 777 11914.0 1.335 1144 42 1.625 0.061 0.000093 0.99991
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 ... 8256.0 864 7392.0 1.541 1291 33 3.714 0.105 0.000057 0.99994
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 ... 3418.0 0 3418.0 2.594 1887 20 2.333 0.000 0.000021 0.99998

3 rows × 23 columns

We will drop the CLIENTNUM column and the last 2 columns

In [5]:
columns = ["CLIENTNUM","Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1",
           "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2"]
In [6]:
df.drop(columns=columns, axis=1, inplace=True)
In [7]:
df.head(3)
Out[7]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
In [8]:
# Seeing the data types
df.dtypes
Out[8]:
Attrition_Flag               object
Customer_Age                  int64
Gender                       object
Dependent_count               int64
Education_Level              object
Marital_Status               object
Income_Category              object
Card_Category                object
Months_on_book                int64
Total_Relationship_Count      int64
Months_Inactive_12_mon        int64
Contacts_Count_12_mon         int64
Credit_Limit                float64
Total_Revolving_Bal           int64
Avg_Open_To_Buy             float64
Total_Amt_Chng_Q4_Q1        float64
Total_Trans_Amt               int64
Total_Trans_Ct                int64
Total_Ct_Chng_Q4_Q1         float64
Avg_Utilization_Ratio       float64
dtype: object
In [9]:
# Creating a profile for automated report
profile = ProfileReport(df, title="Credit_Card_Churn_Report")
In [10]:
# Exporting to a file
profile.to_file("Credit_Card_Churn_Report.html")
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
In [11]:
# Checking null values
df.isnull().sum()
Out[11]:
Attrition_Flag              0
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64

None fo the features have missing values

In [12]:
# Generating a descriptive statistics about the data
df.describe().T
Out[12]:
count mean std min 25% 50% 75% max
Customer_Age 10127.0 46.325960 8.016814 26.0 41.000 46.000 52.000 73.000
Dependent_count 10127.0 2.346203 1.298908 0.0 1.000 2.000 3.000 5.000
Months_on_book 10127.0 35.928409 7.986416 13.0 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.0 3.812580 1.554408 1.0 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.0 2.341167 1.010622 0.0 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.0 2.455317 1.106225 0.0 2.000 2.000 3.000 6.000
Credit_Limit 10127.0 8631.953698 9088.776650 1438.3 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.0 1162.814061 814.987335 0.0 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.0 7469.139637 9090.685324 3.0 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.0 0.759941 0.219207 0.0 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.0 4404.086304 3397.129254 510.0 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.0 64.858695 23.472570 10.0 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.0 0.712222 0.238086 0.0 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.0 0.274894 0.275691 0.0 0.023 0.176 0.503 0.999

The features in general have a mean close to the median value, but the "Credit_Limit" and "Avg_Open_To_Buy" mean differ from the median. We will investigate this behavior through plots

In [13]:
labels = df.Attrition_Flag.value_counts().index
sizes = df.Attrition_Flag.value_counts()
explode=(0, 0.1)
fig1, ax1 = plt.subplots(figsize=(12,6))
ax1.pie(sizes, explode=explode, labels=labels, autopct="%1.1f%%",
       shadow=True, startangle=90)
plt.title("Proportion of customer", size=20)
Out[13]:
Text(0.5, 1.0, 'Proportion of customer')

We can see that the majority of the data (83,9%) refer to the Existing Customer (customer that have an account in the bank) and only 16,1 % refer to the people that canceled the account

In [14]:
fig = px.histogram(df,
             "Attrition_Flag",
             color="Attrition_Flag",
             hover_name="Attrition_Flag",
)

fig.show()

"Education_Level"¶

In [15]:
fig = px.histogram(df,
             "Education_Level",
             color="Attrition_Flag",
             hover_name="Attrition_Flag",
).update_xaxes(categoryorder="total descending")

fig.show()

We can see that the majority of the people are graduated and the minority have a doctorate level. However, we can see the presence of "Attritied Customer" in both levels, with a greater level in the Graduate in comparison with the Doctorate

"Gender"¶

In [16]:
fig = px.histogram(df, 
             x="Gender", 
             color="Attrition_Flag",
             hover_name="Attrition_Flag",
).update_xaxes(categoryorder="total descending")

fig.show()

There are more females than males in the dataset. In addition, there are more woman with and without a bank account

"Card_Category"¶

In [17]:
fig = px.histogram(df, 
             x="Card_Category", 
             color="Attrition_Flag",
             hover_name="Attrition_Flag",
).update_xaxes(categoryorder="total descending")

fig.show()

We observe that the great majority of the data refer to person with Blue card and the minority with Platinum. However, there are presence of people withou the bank account in all the 4 levels of card

"Dependent_Count"¶

In [18]:
px.histogram(df, 
             x="Dependent_count", 
             color="Attrition_Flag",
             hover_name="Attrition_Flag",
)

We can see that most people have 2 to 3 children, and this is also the pattern in relation to the largest number of people who canceled their bank accounts

"Customer_Age"¶

In [19]:
sns.displot(df,
            x="Customer_Age",
            kind="kde",
            hue="Attrition_Flag"
)
Out[19]:
<seaborn.axisgrid.FacetGrid at 0x163b997b5e0>

From the plot we can see that there are more Existing Customer than people that cancelled their bank account. In addition, the data for the two kind of person have the same behavior with more people between 40 and 50 years old

"Credit_Limit"¶

In [20]:
sns.displot(df,
            x="Credit_Limit",
            kind="kde",
            hue="Attrition_Flag"
)
Out[20]:
<seaborn.axisgrid.FacetGrid at 0x163b7c1d4c0>

We can observe that the people with an bank account and the people that cancelled their bank account have the same behavior pattern: the peaks are concetrated near the extreme values of the credit limit

"Credit_Limit"¶

In [21]:
sns.displot(df,
            x="Avg_Utilization_Ratio",
            kind="kde",
            hue="Attrition_Flag"
)
Out[21]:
<seaborn.axisgrid.FacetGrid at 0x163b7c3e310>
In [22]:
# Select numerical feature and exclude categorical feature
num_df = df.select_dtypes(exclude=["object"])
num_df.head(3)
Out[22]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 45 3 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 49 5 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 51 3 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
In [23]:
# Check Pearson's correlation of the numerical features
fig, ax = plt.subplots(figsize=(12,8))

heatmap = sns.heatmap(
    num_df.corr(),
    cmap="Wistia",
    annot=True,
    fmt=".2f"
)

corr_matrix = num_df.corr()

Creating a Model¶

Our target feature will be the "Attrition_Flag" column. So, we will drop that feature from the rest

In [24]:
X = df.drop(columns=["Attrition_Flag"])
X.head(3)
Out[24]:
Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000

Now we will split our X dataframe in categorical and numerical data. After that, we will use the "One Hot Encoding" approach to get rid of the categorical values through the "pd.get_dummies" method

In [25]:
X_cat = X.select_dtypes(include=["object", "bool"]).columns
X_cat
Out[25]:
Index(['Gender', 'Education_Level', 'Marital_Status', 'Income_Category',
       'Card_Category'],
      dtype='object')
In [26]:
X_num = X.select_dtypes(include=["int64", "float64"]).columns
X_num
Out[26]:
Index(['Customer_Age', 'Dependent_count', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')
In [27]:
# Numerical DataFrame
X_num_df = X[X_num]
X_num_df.head(3)
Out[27]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 45 3 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 49 5 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 51 3 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
In [28]:
# Categorical DataFrame
X_cat_df = X[X_cat]
X_cat_df.head(3)
Out[28]:
Gender Education_Level Marital_Status Income_Category Card_Category
0 M High School Married $60K - $80K Blue
1 F Graduate Single Less than $40K Blue
2 M Graduate Married $80K - $120K Blue
In [29]:
# One Hot Encoding approach
X_cat_one_hot_encoded = pd.get_dummies(X_cat_df)
X_cat_one_hot_encoded.head(3)
Out[29]:
Gender_F Gender_M Education_Level_College Education_Level_Doctorate Education_Level_Graduate Education_Level_High School Education_Level_Post-Graduate Education_Level_Uneducated Education_Level_Unknown Marital_Status_Divorced ... Income_Category_$120K + Income_Category_$40K - $60K Income_Category_$60K - $80K Income_Category_$80K - $120K Income_Category_Less than $40K Income_Category_Unknown Card_Category_Blue Card_Category_Gold Card_Category_Platinum Card_Category_Silver
0 0 1 0 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 1 0 0 0
1 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 1 0 0 0
2 0 1 0 0 1 0 0 0 0 0 ... 0 0 0 1 0 0 1 0 0 0

3 rows × 23 columns

The "One Hot Encoding" approach creates a new column feature for every categorical data getting rid of the "names" values

In [30]:
# Recreating the X Dataframe
X_encoded = pd.concat([X_num_df, X_cat_one_hot_encoded], axis=1, join="inner")
X_encoded.head(3)
Out[30]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 ... Income_Category_$120K + Income_Category_$40K - $60K Income_Category_$60K - $80K Income_Category_$80K - $120K Income_Category_Less than $40K Income_Category_Unknown Card_Category_Blue Card_Category_Gold Card_Category_Platinum Card_Category_Silver
0 45 3 39 5 1 3 12691.0 777 11914.0 1.335 ... 0 0 1 0 0 0 1 0 0 0
1 49 5 44 6 1 2 8256.0 864 7392.0 1.541 ... 0 0 0 0 1 0 1 0 0 0
2 51 3 36 4 1 0 3418.0 0 3418.0 2.594 ... 0 0 0 1 0 0 1 0 0 0

3 rows × 37 columns

In [31]:
# Target Feature
y = df["Attrition_Flag"]
y.head(3)
Out[31]:
0    Existing Customer
1    Existing Customer
2    Existing Customer
Name: Attrition_Flag, dtype: object
In [32]:
# Now transforming our target feature
# 0 means a person with an bank account
# 1 means a person that cancelled the bank account
df_transformado = df
df_transformado["Churn"] = np.where(df_transformado["Attrition_Flag"] == "Attrited Customer", 1, 0)

y = df_transformado["Churn"]
y.head(3)
Out[32]:
0    0
1    0
2    0
Name: Churn, dtype: int32

Importing the models¶

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import lightgbm as lgb
In [34]:
from tqdm import tqdm

Now we will obtain metrics for 3 differents kinds of models

Decision Tree Classifier¶

In [35]:
# Pré-alocando
tree_precision = []
tree_accuracy = []
tree_recall = []
tree_f1 = []

for i in tqdm(range(100)):
    X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,
                                                        test_size=0.33,
                                                        stratify=y)
    tree_classifier = DecisionTreeClassifier()
    tree_classifier.fit(X_train, y_train)
    # Metrics
    precision = metrics.precision_score(y_test, tree_classifier.predict(X_test))
    accuracy = metrics.accuracy_score(y_test, tree_classifier.predict(X_test))
    recall = metrics.recall_score(y_test, tree_classifier.predict(X_test))
    f1 = metrics.f1_score(y_test, tree_classifier.predict(X_test))
    # Append values
    tree_precision.append(precision)
    tree_accuracy.append(accuracy)
    tree_recall.append(recall)
    tree_f1.append(f1)
100%|██████████| 100/100 [00:18<00:00,  5.28it/s]

Random Forest Classifier¶

In [36]:
# Pré-alocando
forest_precision = []
forest_accuracy = []
forest_recall = []
forest_f1 = []

for i in tqdm(range(100)):
    X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,
                                                        test_size=0.33,
                                                        stratify=y)
    forest_classifier = RandomForestClassifier()
    forest_classifier.fit(X_train, y_train)
    # Metrics
    precision = metrics.precision_score(y_test, forest_classifier.predict(X_test))
    accuracy = metrics.accuracy_score(y_test, forest_classifier.predict(X_test))
    recall = metrics.recall_score(y_test, forest_classifier.predict(X_test))
    f1 = metrics.f1_score(y_test, forest_classifier.predict(X_test))
    # Append values
    forest_precision.append(precision)
    forest_accuracy.append(accuracy)
    forest_recall.append(recall)
    forest_f1.append(f1)
100%|██████████| 100/100 [02:35<00:00,  1.55s/it]

LGBM Classifier¶

In [37]:
# Pré-alocando
lgb_precision = []
lgb_accuracy = []
lgb_recall = []
lgb_f1 = []

for i in tqdm(range(100)):
    X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,
                                                        test_size=0.33,
                                                        stratify=y)
    lgb_classifier = lgb.LGBMClassifier()
    lgb_classifier.fit(X_train, y_train)
    # Metrics
    precision = metrics.precision_score(y_test, lgb_classifier.predict(X_test))
    accuracy = metrics.accuracy_score(y_test, lgb_classifier.predict(X_test))
    recall = metrics.recall_score(y_test, lgb_classifier.predict(X_test))
    f1 = metrics.f1_score(y_test, lgb_classifier.predict(X_test))
    # Append values
    lgb_precision.append(precision)
    lgb_accuracy.append(accuracy)
    lgb_recall.append(recall)
    lgb_f1.append(f1)
100%|██████████| 100/100 [00:31<00:00,  3.19it/s]
In [38]:
# Preparing the data results to plot
precision_comparison = pd.DataFrame({"DecisionTree":tree_precision, 
                                     "RandomForest":forest_precision,
                                     "LGBM":lgb_precision})

accuracy_comparison = pd.DataFrame({"DecisionTree":tree_accuracy, 
                                    "RandomForest":forest_accuracy,
                                    "LGBM":lgb_accuracy})

recall_comparison = pd.DataFrame({"DecisionTree":tree_recall, 
                                  "RandomForest":forest_recall,
                                  "LGBM":lgb_recall})

f1_comparison = pd.DataFrame({"DecisionTree":tree_f1, 
                              "RandomForest":forest_f1,
                              "LGBM":lgb_f1})

Precision Score¶

Precision talks about how precise the model is. So, out of the predicted positive, this metric shows how many of them are actual positive

In [39]:
fig = plt.figure(figsize = (10,6))
sns.boxplot(data=precision_comparison).set(title="Precision Score")
Out[39]:
[Text(0.5, 1.0, 'Precision Score')]

Accuracy Score¶

Accuracy generally describes the performance of the model across all classes. Basically, it is the fraction of predictions that our model got right

In [40]:
fig = plt.figure(figsize = (10,6))
sns.boxplot(data=accuracy_comparison).set(title="Accuracy Score")
Out[40]:
[Text(0.5, 1.0, 'Accuracy Score')]

Recall Score¶

The recall metric quantifies the number of correct positive predictions that ou model made out of all positive predictions that could possibily been made

In [41]:
fig = plt.figure(figsize = (10,6))
sns.boxplot(data=recall_comparison).set(title="Recall Score")
Out[41]:
[Text(0.5, 1.0, 'Recall Score')]

F1 Score¶

The F-Score gives us a single score that balances both the precision and recall

In [42]:
fig = plt.figure(figsize = (12,6))
b = sns.boxplot(data=f1_comparison).set(title="F1 Score")

We can see from the plots, principally "Precision" and "F1 Score" that the LGBM Classifier is the best classifier for our data

Using "Shap" library to evaluate the features¶

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,
                                                    test_size=0.33,
                                                    stratify=y)
In [44]:
lgb_classifier = lgb.LGBMClassifier()
In [45]:
lgb_classifier.fit(X_train, y_train)
Out[45]:
LGBMClassifier()
In [46]:
explainer_lgbm = shap.Explainer(lgb_classifier.predict, X_test)
In [47]:
shap_values_lgbm = explainer_lgbm(X_test)
Permutation explainer: 3343it [09:16,  5.93it/s]                          
In [48]:
shap.plots.waterfall(shap_values_lgbm[0])
In [49]:
shap.plots.beeswarm(shap_values_lgbm)